Natural Language Parsing as Statistical Pattern Recognition

نویسنده

  • David M. Magerman
چکیده

ii I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, a s a dissertation for the degree of Doctor of Philosophy. I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, a s a dissertation for the degree of Doctor of Philosophy. I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, a s a dissertation for the degree of Doctor of Philosophy. Preface The debate between the relative merits of statistical modeling and linguistic theory in natural language processing has been raging since the days of Zellig Harris and his irreverent student, Noam Chomsky. I h a v e never been shy about expressing my views on this issue. I would have liked nothing more than to declare in my dissertation that linguistics can be completely replaced by statistical analysis of corpora. In fact, I intended this to be my thesis: given a corpus parsed according to a consistent s c heme, a statistical model can be trained, without the aid of a linguistics expert, to annotate new sentences with that same scheme. The most important part of this thesis is that linguists need not participate in the development of the statistical parser. Using the most obvious representations of the annotations in the parsed corpus, the parser should automatically acquire disambiguation rules in the form of probability distributions on parsing decisions. In other words, natural language parsing would be transformed from the never-ending search for the perfect grammar into the simple task of annotating enough sentences to train rich statistical models. 1 With the guidance and support of the statistical modeling gurus in the IBM Speech Recognition Group, I formulated and implemented a statistical parser based on this thesis. In experiments, it parsed a large test set 1473 sentences with a signiicantly higher accuracy rate than a grammar-based parser developed by a highly-respected grammarian. The grammarian spent the better part of a decade perfecting his grammar to maximize its score on the crossing-brackets measure. 2 The grammarian's score 1 This task is not so simple if enough sentences" turns out to be, say, 10 trillion, but I will deal with that issue later. 2 For the deenition of the crossing-brackets measure, see …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to vario...

متن کامل

Specifying Context free Grammar for Marathi Sentences

Marathi is an Indo-Aryan Language and forms the official language of state of Maharashtra. It is ranked as the 4 most spoken language in India and 15 most spoken language in the world. When Computational Linguistic is concerned, writing grammar production for a language is a bit difficult because of different gender and number forms. This paper is an effort to write context free grammar for Mar...

متن کامل

Symbolic Parsing and Probabilistic Decision Making. the Speech and Language Experience with Hybrid Information Processing

In natural language technology up to now most projects were based on either logical and linguistic methods or they were strictly based on stochastic techniques alone borrowed from pattern recognition. This article discusses hybrid symbolic and stochastic techniques in natural language processing as they are currently explored in many projects and in particular in our work within the Verbmobil p...

متن کامل

Language Understanding as Recognition and Transduction of Numerous Overlaid Patterns

Few computational linguists would dispute that language understanding crucially involves pattern recognition, in some sense, and transduction to an internal representation. However, theories differ on the sorts of patterns and processes they posit as underlying this recognition and transduction process. I would characterize the traditional view, dominant for several decades (and decidedly still...

متن کامل

The BBN Spoken Language System

We describe HARC, a system for speech understanding that integrates speech recognition techniques with natural language processing. The integrated system uses statistical pattern recognition to build a lattice of potential words in the input speech. This word lattice is passed to a unification parser to derive all possible associated syntactic structures for these words. The resulting parse str...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/cmp-lg/9405009  شماره 

صفحات  -

تاریخ انتشار 1994